BUG: read_csv adding additional columns as integers instead of strings #47137

phofl · 2022-05-27T09:58:41Z

closes BUG: read_csv(index_col=False) -> auto-generated column headers for variable row length get headers of type int #46997 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

All columns are interpreted as strings, even if they are numeric. So, we have to be consistent with the bogus columns

simonjayhawkins · 2022-05-27T11:07:01Z

pandas/_libs/parsers.pyx

@@ -1304,8 +1304,10 @@ cdef class TextReader:
            if self.header is not None:
                j = i - self.leading_cols
                # generate extra (bogus) headers if there are more columns than headers
+                # These should be strings, not integers, because otherwise we might get
+                # issues with callables as usecols GH#46997


if these are truely bogus, why are they even passed to the usecols callable? There are no circumstances where these "auto-generated" column names are/should be in the result?

see #47138, this is the bogus case

You can access them when using an index-based usecols setting, but the column name is shifted afterwards.

Edit: If we want to change this (I think I would be in favor of that), we have to deprecate first

but the column name is shifted afterwards.

right, so if they don't currently make their way into the result, we are not at risk of changing existing behavior with this PR?

No.

At least I can not see a way in how this would impact anything.

You can not select these columns by name via usecols. If you select them by position, an available name is used, not the bogus name.

The only impact this should have is consistensy when feeding them into the usecols callable

At least I can not see a way in how this would impact anything.

cool.

simonjayhawkins · 2022-05-27T12:02:57Z

doc/source/whatsnew/v1.5.0.rst

@@ -809,6 +809,7 @@ I/O
 - Bug in :func:`read_parquet` when ``engine="pyarrow"`` which caused partial write to disk when column of unsupported datatype was passed (:issue:`44914`)
 - Bug in :func:`DataFrame.to_excel` and :class:`ExcelWriter` would raise when writing an empty DataFrame to a ``.ods`` file (:issue:`45793`)
 - Bug in :func:`read_html` where elements surrounding ``<br>`` were joined without a space between them (:issue:`29528`)
+- Bug in :func:`read_csv` adding columns as integers instead of string when data is longer than header leading to issue with ``usecols`` (:issue:`46997`)


can you reword so others are not confused like I was. "adding columns as integers" is misleading?

Done, but this is tricky. Technically this is an implementation detail that should not leak into the outside world...

jreback

lgtm ex @simonjayhawkins comments

jreback

lgtm merge on green

pandas-dev#47137) * BUG: read_csv adding additional columns as integers instead of strings * Reword whatsnew

BUG: read_csv adding additional columns as integers instead of strings

ef898b4

phofl added IO CSV read_csv, to_csv Bug labels May 27, 2022

simonjayhawkins reviewed May 27, 2022

View reviewed changes

jreback added this to the 1.5 milestone May 27, 2022

jreback approved these changes May 27, 2022

View reviewed changes

Reword whatsnew

6559309

jreback approved these changes May 27, 2022

View reviewed changes

phofl merged commit c9ce063 into pandas-dev:main May 27, 2022

phofl deleted the 46997 branch May 27, 2022 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv adding additional columns as integers instead of strings #47137

BUG: read_csv adding additional columns as integers instead of strings #47137

phofl commented May 27, 2022

simonjayhawkins May 27, 2022

phofl May 27, 2022 •

edited

Loading

simonjayhawkins May 27, 2022

phofl May 27, 2022

simonjayhawkins May 27, 2022

simonjayhawkins May 27, 2022

phofl May 27, 2022

simonjayhawkins May 27, 2022

jreback left a comment

jreback left a comment

BUG: read_csv adding additional columns as integers instead of strings #47137

BUG: read_csv adding additional columns as integers instead of strings #47137

Conversation

phofl commented May 27, 2022

simonjayhawkins May 27, 2022

Choose a reason for hiding this comment

phofl May 27, 2022 • edited Loading

Choose a reason for hiding this comment

simonjayhawkins May 27, 2022

Choose a reason for hiding this comment

phofl May 27, 2022

Choose a reason for hiding this comment

simonjayhawkins May 27, 2022

Choose a reason for hiding this comment

simonjayhawkins May 27, 2022

Choose a reason for hiding this comment

phofl May 27, 2022

Choose a reason for hiding this comment

simonjayhawkins May 27, 2022

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

phofl May 27, 2022 •

edited

Loading